1. Introduction

Our group conducts the data exploration and visualization of weather events over the United States. Our analysis includes Event-Type of weather, Duration of events, Magnitude of events, Injuries&Deaths and damage caused by events and time series analysis of Year Trend (2000-2017) regarding different features. Our static graphs are presented in the main analysis. Our shiny graphs(https://edav-project-weather-event-analysis.shinyapps.io/WeatherEvent/) and D3 plots (https://blockbuilder.org/Inhenn/4d6a10d9feefbd5132ceb8beeaaf5c30 and https://blockbuilder.org/Inhenn/dbf9c470f6f61d159a4c17c8c0356e85) are presented in the interactive component.

2. Description of Data

Dataset Storm Data

This dataset comes from the Storm Events Database of National Centers for Environmental Information. We get it from the website: https://www.ncdc.noaa.gov/stormevents/ftp.jsp The first dataset focused on the year 2017, which summarized various of varaibles regarding the storm events, for example, categorical data like month name and event type and numerical data including the number of injuries&death and damage property.

  • This dataset consists of following important features:
    • BEGIN_YEARMONTH: The begin time of weather event format in ‘year’ and ‘month’
    • BEGIN_DAY: The begin day of weather event
    • END_YEARMONTH: The end time of weather event format in ‘year’ and ‘month’
    • END_DAY: The end day of weather event
    • STATE: States where events took place
    • YEAR: The year that events took place
    • MONTH_NAME: Name of the month for the event in this record (spelled out; not abbreviated)
    • EVENT_TYPE: Ex: Hail, Thunderstorm Wind, Snow, Ice (spelled out; not abbreviated)
    • CZ_TYPE: Ex: C, Z , M, Indicates whether the event happened in a (C) county/parish, (Z) zone or (M) marine
    • BEGIN_DATE_TIME: The time which specific event begins
    • CZ_TIMEZONE: Eastern Standard Time (EST), Central Standard Time (CST), Mountain Standard Time (MST), etc.
    • END_DATE_TIME: The time which specific event ends
    • INJURIES_DIRECT: The number of injuries directly related to the weather event
    • INJURIES_INDIRECT: The number of injuries indirectly related to the weather event
    • DEATHS_DIRECT: The number of deaths directly related to the weather event
    • DEATHS_INDIRECT: The number of deaths indirectly related to the weather event.
    • DAMAGE_PROPERTY: The estimated amount of damage to property incurred by the weather event.
    • DAMAGE_CROPS: The estimated amount of damage to crops incurred by the weather event
    • MAGNITUDE: Measured extent of the magnitude type ~ only used for wind speeds and hail size
    • MAGNITUDE_TYPE: EG = Wind Estimated Gust; ES = Estimated Sustained Wind; MS = Measured Sustained Wind; MG = Measured Wind Gust (no magnitude is included for instances of hail)
    • FLOOD_CAUSE: Reported or estimated cause of the flood

Dataset Storm Data over years

The second dataset consisted of the storm data like above but covered year 2000 - year 2017.

3. Analysis of Data Quality

data=read.csv('StormEvents_details-ftp_v1.0_d2017_c20181017.csv')
visna(data)

The overall quality of our dataset is excellent, as it is cited from the website of Natianal Centers for Environmental Information. The data is very clean and free of typos. However, there are still some missing values according to the NA plot above (the label display is not clean, but since this is a built-in plot, we cannot adjust the label size). As is shown in the graph, the most frequent missing values are CATEGORY, TOR_F_SCALE, TOR_OTHER_CZ_FIPS. According to the data description file, CATEGORY is a redundent feature, where all values are missing. “TOR_F_SCALE” and “TOR_OTHER_CZ_FIPS” are features for Tornados. So, if the event type is not Tornado, it is expected that these two values will be missing. The other common missing features are BEGIN_LAT/LON and END_LAT/LON, because many storm events are hard to determine the beginning and ending location. In conclusion, this dataset is very clean and reliable in spite of some missing values. In our following data analysis and visualization, we avoided most of the missing-value columns, and most our work is based on relative complete features.

4. Main analysis (Exploratory Data Analysis)

4.1 Death and Injuries

The number of death and injuries directly related to the weather event

data=read.csv('injury_death.csv')
ggplot(data,aes(x=reorder(EVENT_TYPE, INJURIES_DEATH, median),y=INJURIES_DEATH,group=EVENT_TYPE)) +
  geom_boxplot() +
  scale_y_continuous(labels=comma) +
  coord_flip(ylim=c(0,70)) +
  ggtitle('The number of injuries and death group by event type') 

Here, the above graph is box plots on direct&indirect injuries and death for different event type (Ex: Hail, Thunderstorm Wind, Snow, Ice) and order them by mdedian injuries and death number. As can be seen in the above graph, injuries and death caused by Lake-Effect Snow has the highest mdeian value. The top five injuries and death for median are Lake-Effect Snow, Waterspout, ice storm, heat and excessive heat. Also, we can observe that the number of injuries has some extreme value with tornado and hurricane, which may infer that tornado and hurricane are more relatively dangerous.

The number of death and injuries directly related to the state

state_data = data %>% group_by(STATE) %>% summarise(value=sum(INJURIES_DEATH))
state=data.frame(region =state_data$STATE,value=state_data$value)
state$region=tolower(state$region)
ggplot(state,aes(x=reorder(region,value),y=value)) +
  geom_bar(stat = 'identity',fill='darkslateblue') +
  geom_text(aes(label = round(value,3),hjust=-0.5), size = 3.5) +
  xlab("State") + 
  ylab("Injuries and Death") +
  coord_flip() +
  ggtitle('2017 Injuries abd Death by State') +
  theme(axis.title.x = element_text(size = 15, family = "myFont", face = "bold"),axis.title.y = element_text(size = 15, family = "myFont", face = "bold"),axis.text=element_text(size=8,vjust = 0.5, hjust = 0.5, angle = 0))

The bar plot above show the exact number of death and injuries for each state. We can see here that Colorado has the highest number of 511 death and injuries people while Vermontand Maine has the lowest number of only one.

The number of death and injuries directly related to the month

theme_dotplot=theme_bw(18) +
    theme(axis.text.y = element_text(size = rel(.75)),
          axis.ticks.y = element_blank(),
          axis.title.x = element_text(size = rel(.75)),
          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_line(size = 0.5),
          panel.grid.minor.x = element_blank())
injury_by_month=data %>% group_by(MONTH_NAME) %>% 
  summarize(number_of_event = n())
ggplot(injury_by_month,aes(x=number_of_event,y=reorder(MONTH_NAME,number_of_event))) + 
         geom_point(col='blue',size=3) +
         theme_dotplot +
         ggtitle('The number of event triggering injuries and death group by months')

The above graph is the dot plot about the number of event triggering injuries and death group by 12 months and I sort it by number. As can be seen in the graph, total number of event in January is the highest amoung 12 months. We can infer that January is the month where natural disaster most frequently occurs. Similarly, we can predict that November is the most peaceful time. Also, we can infer that January, July and August are top 3 dangerous month in a year.

4.2 Duration Analysis

The data includes multiple kinds of weather events in the US. Differernt kinds of weather event will have significantly different duration time. The duration time is computed by the start time and end time of each observation and counted in minutes. Hence, according to the event type, draw the below histograms. Since that the range of duration differs too much, we have to free the scales for each plot. Each histogram can be easily reached in our shiny app #IP address. The histogram of duration for each event type

data_withevent = read.csv("./Data/duration_histogram_data.csv")

ggplot(data_withevent, aes(DURATION)) +
  geom_histogram(bins = 60, fill = "lightblue", color = "blue") +
  ggtitle("Histogram of Duration by Event type")+
  theme(plot.title = element_text(hjust = 0.5)) +
  ylab("Frequency")+
  facet_wrap(~EVENT_TYPE, scales = "free")

In order to know the average duration for each kind of event, I draw the ClevelandDotPlot to show duration differences between each other. The difference of the mean of duration

duration_mean_event = read.csv("./Data/duration_mean_data.csv")
theme_dotplot <- theme_bw(15) +
    theme(axis.text.y = element_text(size = rel(.75)),
          axis.ticks.y = element_blank(),
          axis.title.x = element_text(size = rel(.75)),
          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_line(size = 0.5),
          panel.grid.minor.x = element_blank())+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(duration_mean_event) +
  geom_point(aes(mean, fct_reorder(EVENT_TYPE, mean)), size = 2, alpha = 0.8, color = "blue") +
  theme_dotplot +
  ylab("EVENT_TYPE")+
  xlab("Mean of Duration") + 
  ggtitle("Mean of Duration over Event_types")

From the plot we can see that drought and flood usually last for long time, approximately 20-30 days. However, Wind, Hail and Tornado usually last for short time. For Tornado, the data includes any small Tornado which might influence the mean of the duration. Based on the mean distribution, we can classify the event into four groups, short, short-medium, medium-long and long duration. For each group, draw the below boxplot to see the difference of each group.

data_withevent = read.csv("./Data/duration_boxplot_data.csv")

h1 <- ggplot(data_withevent[data_withevent$dlevel == "d1",], aes(reorder(EVENT_TYPE, -DURATION, FUN = median), DURATION)) +
  geom_boxplot(aes(group = EVENT_TYPE), fill = "lightgray", color = "black") +
  ggtitle("Histogram of Duration")+
  xlab("EVENT_TYPE")+
  ylab("DURATION")+
  theme(plot.title = element_text(hjust = 0.5)) +
  ylim(0,20)+
  coord_flip()
h2 <- ggplot(data_withevent[data_withevent$dlevel == "d2",], aes(reorder(EVENT_TYPE, -DURATION, FUN = median), DURATION)) +
  geom_boxplot(aes(group = EVENT_TYPE), fill = "lightgray", color = "black") +
  ggtitle("Histogram of Duration")+
  xlab("EVENT_TYPE")+
  ylab("DURATION")+
  theme(plot.title = element_text(hjust = 0.5)) +
  ylim(0,1000)+
  coord_flip()
h3 <- ggplot(data_withevent[data_withevent$dlevel == "d3",], aes(reorder(EVENT_TYPE, -DURATION, FUN = median), DURATION)) +
  geom_boxplot(aes(group = EVENT_TYPE), fill = "lightgray", color = "black") +
  ggtitle("Histogram of Duration")+
  xlab("EVENT_TYPE")+
  ylab("DURATION")+
  theme(plot.title = element_text(hjust = 0.5)) +
  ylim(0,4000)+
  coord_flip()
h4 <- ggplot(data_withevent[data_withevent$dlevel == "d4",], aes(reorder(EVENT_TYPE, -DURATION, FUN = median), DURATION)) +
  geom_boxplot(aes(group = EVENT_TYPE), fill = "lightgray", color = "black") +
  ggtitle("Histogram of Duration")+
  xlab("EVENT_TYPE")+
  ylab("DURATION")+
  theme(plot.title = element_text(hjust = 0.5)) +
  ylim(0,70000)+
  coord_flip()
grid.arrange(h1,h2,h3,h4,ncol = 1)

The duration time directly related to the injuries and deaths

data_withinjure = read.csv("./Data/duration_injure_data.csv")
ggplot(data_withinjure, aes(DURATION, INJURE_DEATH)) +
  geom_point(size = 1, alpha = 0.5) +
  facet_wrap(~EVENT_TYPE, scales = "free") +
  theme_classic(6.5) +
  geom_smooth(method = "lm", se = FALSE,size = 0.5, color = "red") +
  geom_density_2d(alpha = 0.6) +
  ylab("Injures and Deaths") +
  ggtitle("Relationship between duration and injures and deaths")+
  theme(plot.title = element_text(hjust = 0.5))

Seen from the facet graph, usually the longer the duration was, the more injuries and deaths happened. The relationship also differs by event type.

4.3 Magnitude Analysis

Since only a few kinds of event can be measured by magnigtude. We choose “Hail”, “Strong Wind”, “High Wind”, “Marine Thunderstorm Wind” and “Thunderstorm Wind” for analysis. Draw the histogram of magnitude of the five kinds of events.

data_withmagnitude = read.csv("./Data/magnitude_histogram_data.csv")
ggplot(data_withmagnitude, aes(MAGNITUDE)) +
  geom_histogram(bins = 60, fill = "lightblue", color = "blue") +
  ggtitle("Histogram of Magnitude")+
  theme(plot.title = element_text(hjust = 0.5)) +
  ylab("Frequency")+
  facet_wrap(~EVENT_TYPE, scales = "free")

Analyze the mean of the magnitude of the five kinds of event type

magnitude_mean_event = read.csv("./Data/magnitude_mean_data.csv")
ggplot(magnitude_mean_event) +
  geom_point(aes(mean, fct_reorder(EVENT_TYPE, mean)), size = 2, alpha = 0.8, color = "blue") +
  theme_dotplot +
  ylab("EVENT_TYPE")+
  xlab("Mean of Magnitude") + 
  ggtitle("Mean of Magnitude over Event_types")

Seen from the graph, usually “Thunderstorm Wind” has the highest magnitude, while “Hail” has the lowest magnitude.

4.4 Tornado

Bseides, we did an insteresting topic related to “Tornado” event. Fist, we analyze the occurrence of Tornado over states.

Tornado_state = read.csv("./Data/Tornado_state.csv")
ggplot(Tornado_state,aes(x=reorder(region,value),y=value)) +
  geom_bar(stat = 'identity',fill='darkslateblue') +
  geom_text(aes(label = round(value,3),hjust=-0.2), size = 3) +
  xlab("State") + 
  ylab("Frequency of Tornado") +
  coord_flip() +
  ggtitle('Tornado distribution over US')+
  theme(axis.title.x = element_text(size = 12, face = "bold"),axis.title.y = element_text(size = 15, face = "bold"),axis.text=element_text(size=6,vjust = 0.5, hjust = 0.5, angle = 0))

Spatial distribution,

g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa')
  
)
plot_geo(Tornado_state, locationmode = 'USA-states') %>%
  add_trace(
    z = ~value, locations = ~code,
    color = ~value, colors = colorRampPalette(brewer.pal(11,"RdPu"))(100)
  ) %>%
  colorbar(title = "Frequency") %>%
  layout(
    title = '2017 USA Tornado Frequency by State',
    geo = g
  )

Hence, Tornado happens most frequently in TEXAS, GEORGIA, OKLAHOMA, and LOUISIANA.

4.5 Damage

data = read.csv("./StormEvents_details-ftp_v1.0_d2017_c20181017.csv")
a=as.numeric(sub("K", "e3", data$DAMAGE_PROPERTY, fixed = TRUE))
b=as.numeric(sub("M", "e6", data$DAMAGE_PROPERTY, fixed = TRUE))
a[is.na(a)]=b[is.na(a)]
data$DAMAGE_PROPERTY=a
data$DAMAGE_PROPERTY[is.na(data$DAMAGE_PROPERTY)]=0

a=as.numeric(sub("K", "e3", data$DAMAGE_CROPS, fixed = TRUE))
b=as.numeric(sub("M", "e6", data$DAMAGE_CROPS, fixed = TRUE))
a[is.na(a)]=b[is.na(a)]
data$DAMAGE_CROPS=a
data$DAMAGE_CROPS[is.na(data$DAMAGE_CROPS)]=0
data = data[data$DAMAGE_PROPERTY!=0,]

data$code = state.abb[match(data$STATE,toupper(state.name))]

data = data[complete.cases(data$code),]
data$LOG_DAMAGE_PROPERTY = log10(data$DAMAGE_PROPERTY)

data$BEGIN_DATE_TIME = dmy_hms(data$BEGIN_DATE_TIME)

We parse K/M to numbers and delete 0 values since we care about event that cause property damage. Transforming the damage amount to log10 based, we can see more clearly the detail of each graph. (Also cleaned and transformed damage of crops for later use)

4.5.1 Property damage by state in map

df <- aggregate(data$DAMAGE_PROPERTY,by=list(code=data$code), FUN=sum)
df$logx = log10(df$x)
l <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa')
  
)

p1 <- plot_geo(df, locationmode = 'USA-states') %>%
  add_trace(
    z = ~logx, text = ~paste("Property Damage:", x), locations = ~code,
    color = ~logx, colors = 'Purples'
  ) %>%
  colorbar(title = "10^ USD") %>%
  layout(
    title = '2017 USA Storm Damage by State',
    geo = g
  )

p1

In this graph, we used heat map to show log property damage amount by state. It clearly shows coastal areas tend to have the more property damage than continental areas. CA, TX, FL are the three states that have the highest property damage.

4.5.2 Property damage vs Crop damage log based

df = data[data$DAMAGE_CROPS!=0,]
df$LOG_DAMAGE_CROPS = log10(df$DAMAGE_CROPS)
p3 <- ggplot(df, aes(LOG_DAMAGE_CROPS, LOG_DAMAGE_PROPERTY)) + geom_point(alpha = .4) + geom_density_2d(bins = 30)+stat_smooth(method = "lm", col = "red")
ggplotly(p3)

We thought that the log property damage and log crop damage may have a linear relationship. Thus, we ploted them in a graph and fit a linear model on it. The graph confirmed our assumption. There is a strong linear relation between log property damage and log crop damage. From the density contours, we can see that the data points highly concentrate on some clusters.

p3 <- plot_ly(
  df, x = ~LOG_DAMAGE_PROPERTY, y = ~LOG_DAMAGE_CROPS,
  text = ~paste("Property Damage: ", DAMAGE_PROPERTY, '$<br>Crop Damage:', DAMAGE_CROPS),
  color = ~LOG_DAMAGE_PROPERTY, size = ~LOG_DAMAGE_PROPERTY
)
p3

In this graph, we can more clearly see the data points in an equal scale.

4.5.3 Property damage by month

accumulate_by <- function(dat, var) {
  var <- lazyeval::f_eval(var, dat)
  lvls <- plotly:::getLevels(var)
  dats <- lapply(seq_along(lvls), function(x) {
    cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
  })
  dplyr::bind_rows(dats)
}
ag = aggregate(data$DAMAGE_PROPERTY,by=list(Category=data$MONTH_NAME), FUN=sum)
ag$Category = factor(ag$Category, levels = month.name)
ag = ag[order(ag$Category),]
ag$ID <- seq.int(nrow(ag))

ag = ag %>% accumulate_by(~ID)

p4 <- ag %>%
  plot_ly(
    x = ~Category, 
    y = ~x,
    frame = ~frame,
    type = 'scatter',
    mode = 'lines', 
    line = list(simplyfy = F)
  ) %>% 
  layout(
    xaxis = list(
      title = "Date",
      zeroline = F
    ),
    yaxis = list(
      title = "Property Damage",
      zeroline = F
    )
  ) %>% 
  animation_opts(
    frame = 100, 
    transition = 50, 
    redraw = FALSE
  ) %>%
  animation_slider(

    )%>%
  animation_button(
    x = 1, xanchor = "right", y = 0, yanchor = "bottom"
  )
p4

In this animated graph, we illustrate the total property damage by month. There is a spike in August and September, which we think is probably because it’s the season for hurricane. The peak is in August with 7.35 billion loss, which is about 30 times more than in July.

5 Executive summary

Death and Injuries

The spatial visualization is one of the most macroscopic ways for us to draw the insight of the number of death and injuries crossing 50 states. The bar charts above can only be able to show the ranking of the number of death and injuries by giving us the name of state while the map is able to illustrate on the spatial variations.

library(choroplethr)
library(choroplethrMaps)
state_choropleth(state, title="2017 Injuries abd Death by State", legend="Injuries and Death")

As can be seen in the above map about the number of death and injuries by state, CA, NV, CO, TX, MO, GA, FL have the relatively higher death and injuries number among all the state and they locate in the range of [125,511]. In the meantime, ID, WY, ND, VT, ME have the relatively lower death and injuries number among all the state and they locate in the range of [1,4].

Annual Storm Event Summary

Annual Storm Event Summary

(The whole graph can be find here https://blockbuilder.org/Inhenn/4d6a10d9feefbd5132ceb8beeaaf5c30) In the past decades, the earth is said to become warmer and warmer. With the increase of global temperature, we would like to know if there is also an increase in the storm events we get in United States. The graph above is the screen shot of a d3 animated bar chart. The blue line is the annual total storm events from 2000 to 2017, while the orange line is the annual direct death caused by these events. According to this graph, we did not see any significant increase in storm event numbers throughout the years.

Analyzing the mean of injuries and deaths of each event type, we have the below graph.

injure_mean_event = read.csv("./Data/injure_mean_event.csv")
ggplot(injure_mean_event) +
  geom_point(aes(mean, fct_reorder(EVENT_TYPE, mean)), size = 2, alpha = 0.8, color = "blue") +
  theme_dotplot +
  ylab("EVENT_TYPE")+
  xlab("Mean of Injures and Deaths") + 
  ggtitle("Mean of Injures and deaths over Event_types")

Seen from the graph, we can see that “Hurricanne”, “Excessive Heat” and “Tornado” causes a lot of injuries and deaths. “Flood” and “Wind” usually causes small amoung of injuries and deaths.

The duration time directly related to the magnitude

data_withmagnitude = read.csv("./Data/duration_magnitude_data.csv")
ggplot(data_withmagnitude, aes(DURATION, MAGNITUDE)) +
  geom_point(size = 1, alpha = 0.25) +
  facet_wrap(~EVENT_TYPE, scales = "free") +
  theme_classic(8) +
  geom_smooth(method = "lm",size = 0.5, color = "red") +
  geom_density_2d(alpha = 0.6) +
  ylab("MAGNITUDE") +
  ggtitle("Relationship between Duration and Magnitude")+
  theme(plot.title = element_text(hjust = 0.5))

From the scatter plots, we can see that for the five kinds of event, duration is positively related to magnitude.

Cluster on map

df = data[complete.cases(data$BEGIN_LAT),]
df$LOG_DAMAGE_CROPS = log10(df$DAMAGE_CROPS)
leaflet(df)%>% addTiles() %>% addMarkers(
  ~BEGIN_LON,~BEGIN_LAT,popup = ~paste(BEGIN_DATE_TIME, EVENT_TYPE,"Property Damage:", DAMAGE_PROPERTY, sep = "<br />"),
                                        
  clusterOptions = markerClusterOptions()
)

This graph shows all storm occurred by clusters in regions. Using zoom in, you can subset the clusters into smaller clusters in zoomed map. Finally, you can see each storm event on the map and click to see the detail of each event. This gives us a more clear view of the allocation of each storm event.

Some other interesting explorations

flood=read.csv('flood.csv')
ggplot(flood,aes(x=FLOOD_CAUSE,fill=EVENT_TYPE)) +
  geom_bar(position = 'dodge') +
  xlab('The cause of flood') +
  ylab('The number of events') +
  scale_fill_discrete(name = "Flood Type") +
  theme(plot.title = element_text(size = 20),
    legend.title=element_text(size=15), 
    legend.text=element_text(size=15)) +
  theme(text = element_text(size=15))+
  theme(legend.position = 'bottom') +
  theme(panel.background = element_blank()) +
  ggtitle('The relationship between the cause and the type of flood')

We can see the graph here illustrating the relationship between the cause and the type of flood. Heavy rain seems to be the main cause of three types of flood, namely Debris Flow, Flash Flood and Flood.

Tornado_month_data = read.csv("./Data/Tornado_month_data.csv")

ggplot(Tornado_month_data) +
  geom_point(aes(value, fct_reorder(MONTH_NAME, value)), size = 2, alpha = 0.8, color = "blue") +
  theme_dotplot +
  ylab("Occuring Month")+
  xlab("Frequency of Month") + 
  ggtitle("2017 Monthly Frequency of Tornado")

From the dotplot, you can usually, Tornado happens most frequently in May and least frequently in December.

Furthermore, we draw the year trend of Tornado damage and frequency from 2000 to 2017. The dynamic graph can be seen in the link https://blockbuilder.org/Inhenn/dbf9c470f6f61d159a4c17c8c0356e85.

6 Interactive component

In order to visuilize the analysis, we built a shiny app to intergrate all of our analysis and a d3 html for show the time series data. Shiny link: https://edav-project-weather-event-analysis.shinyapps.io/WeatherEvent/ Html link(Storm Event Summary): https://blockbuilder.org/Inhenn/4d6a10d9feefbd5132ceb8beeaaf5c30 Html link(Year Trend of Tornado): https://blockbuilder.org/Inhenn/dbf9c470f6f61d159a4c17c8c0356e85.

In Shiny, we built 6 tabs with “Event type”, “Duration Analysis”, “Magnitude Analysis”, “Event Topics” and “Damage Summary”. They use interactive components to show the data distribution.

In the first tab in shiny, we present a brief summarized graph for top 5 popular storm event type for each state, and top 5 popular state for each storm event. While working on a storm event dataset, the first question comes to our mind is what is the most popular storm event in each state. Therefore, we created this interactive graph, which enables user to select their favourite state and check the top 5 most frequent storm event in that state. Once a state is selected in the filter window, user will see the top 5 most popular event with corresponding frequency in the bar chart. Similarly, we also provide a bar chart for user to explore the top 5 most frequently visited states for each event type. When user select a specific event type, the top 5 popular states will show in the bar graph. We think this charts will help user to get a brief idea of the storm event distribution in each states, and hopefully user will get some suggestions about which place is the best to avoid storm disasters.

In the tab “Casualties”, we build a shiny interactive graph regarding the number od death and injuries caused by different event type from 2011-2017. We can change the event type labels here to observe yearly patterns of casualties in various of weather types. We can find something noteworthy here that the extreme weather tornado caused so many casualties in 2011 but stayed very low in recent 6 years. This pattern accorded with the 2011 Super Outbreak, which was the largest, costliest and one of the deadliest tornado outbreaks ever recorded. What’s more, winter weather caused the highest number death and injuries in 2014 among 2011-2017, which accored with the 2014 North American cold wave–an extreme weather event that extended through the late winter months of the 2013–2014 winter season.

In the tab “Property damage analysis”, under title “US Storm Property Damage by Factors”, we listed for factors that we think are high associated with property damage. First, in event type analysis, we found tropical storm, hurricane and flood cause most of the property loss. Second, specific to tornado caused property loss, we found the f-scale has a positive relation with property damage except for EF4 which is the highest level in f-scale. In August and September, the storms cause more property damage.There are a couple states that have the most of the property loss, which includes TX, FL, CA and MI. Under title “US Storm Property Damage by State”, we can see the log property damage histogram by each state. Some state don’t have a pattern, most of the states have a normal or binormal pattern. Under title “Histogram of Log Property Damage with Changing Bins”, we can change the bins in the property damage histogram to see how the distribution looks like in different bin width.

In the “Html link(Storm Event Summary)” we can see a d3 ploy for Storm Event Year Trend. Most our data analysis have been focusing on the dataset in 2017. However, there are also datasets for previous years in our source website. Utilizing these datasets, we can present the time trend for storm events in United States.The graph is made in d3. User can go to the website using the link above.

To get a full view of the graph, please click the “GO into fullscreen” button on the upper-left edge of your browser.

The animation displays the time trend of total storm events from 2000 to 2017. On the top there are two buttons where we can select either direct deaths or direct injuries we want to plot and compare. So, in this section, we are going to combine the storm datasets from 2000-2017, and count the total recorded storm events for each year, along with the directed deaths and injuries caused by these event. Meanwhile, the direct deaths and injuries have a strong correlation with the total storm events as expected. However, in year 2005 we can see a peak of Direct Death but we didn’t observe any peak in storm event of that year. We looked up the event online, and we get that in 2005 there was a horrible Hurricane “Katrina” which caused more 1000 deaths.

In the Duration and Magnitude Analysis part, from the faceted histogram, we can see that different event type has significantly different duration time and magnitude. It is not convenient to put them in the same scales for comparison. It is better to built a choice for event type to show the exact histogram of duation or magnitude of that event type. Hence, we created the shiny histogram, where you choose which event type you like to see. Besides, on each event, the histogram can show in its own scale, which will show the exact distribution of the data.

Moreover, in Duration Analysis, we have divided the event type into 4 groups based on duration time. It would better to create boxplots in the scale of each group. Hence, we created the boxplot with group choices in shiny app. Similarly, for convenience of choosing each event type, we changed the original scatter plot into shiny scatter plots with choice of event types. It is easier to see the relationship between casualties with duration of each event type.

In the Tornado topic part, for better show the exact number of frequency of tornado of each state, we created the shiny choropleth map. In addition, we can analyze total damage, mean damage and frequency in one graph by choosing different options. We can easily see that in 2011 the damage of tornado reaches the peak value.

7 Conclusion

Duration time for different events differs dramatically. According to our analysis, we divided the events into 4 groups based on duration time, which is showed in the above bosplot.

We tried to analyze the relationship between duaraion with injuries and deaths. We found that they are slightly correlated. For most events, they are positively related. More duration time means more injuries in the event.

For weather events with property magnitude, duration is positively related to magnitude. Events with more duation time usually have larger magnitude.

For tornado topic, tornado happens more frequently in the middle and south area of the US, where Texas is the state with most tornado occurrences. The west land has very low frequency of tornado. Besides, tornado happens most in May, while least in December.

For the part of casualties analysis in 2017, we can draw the conclusion that Lake-Effect Snow, Waterspout, ice storm, heat and excessive heat are top 5 dangerous weather event while tornado and hurricane can cause extreme case. In the mean time, Colorado, Texas, Florida, Georigia and California are top 5 dangerous state with relatively higher number of injuries and death. Moreover, we can find that January, July, August and March are top 4 months with high weather event likelihood.

In the analysis of property damage, we found the property damage is highly associated with location, crop damage, event type and time of the year. We found that costal regions are more suffered from property loss. Also, in August and Septembe, which are hurricane seasons, the storm are more frequently and cause more damage. Crop damage has a log linear relation with property damage, which makes sence and confirmed by the linear model. Tropical storm, hurricane and flood cause most of the property loss.

As we can see from the event year trend summary there is no significant increasing trend of total storm events in the past 17 years.